Efficient Algorithms for Locating the Length-Constrained Heaviest Segments, with Applications to Biomolecular Sequence Analysis

نویسندگان

  • Yaw-Ling Lin
  • Tao Jiang
  • Kun-Mao Chao
چکیده

We study two fundamental problems concerning the search for interesting regions in sequences: (i) given a sequence of real numbers of length n and an upper bound U , find a consecutive subsequence of length at most U with the maximum sum and (ii) given a sequence of real numbers of length n and a lower bound L, find a consecutive subsequence of length at least L with the maximum average. We present an O(n)time algorithm for the first problem and an O(n log L)-time algorithm for the second. The algorithms have potential applications in several areas of biomolecular sequence analysis including locating GC-rich regions in a genomic DNA sequence, post-processing sequence alignments, annotating multiple sequence alignments, and computing length-constrained ungapped local alignment. Our preliminary tests on both simulated and real data demonstrate that the algorithms are very efficient and able to locate useful (such as GC-rich) regions.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Definitions and Algorithms in SEGID

Given a (multiple) sequence alignment, SEGID first converts it into a sequence of numbers, where each number is the alignment score of a column. (SEGID also directly accepts a sequence of numbers as input.) Then it provides three algorithms to identify conserved segments (high score substrings): 1. Longest segment (with average value lower bound): given a string of numbers and a number A, find ...

متن کامل

Algorithms for the Problems of Length-Constrained Heaviest Segments

We present algorithms for length-constrained maximum sum segment and maximum density segment problems, in particular, and the problem of finding length-constrained heaviest segments, in general, for a sequence of real numbers. Given a sequence of n real numbers and two real parameters L and U (L 6 U), the maximum sum segment problem is to find a consecutive subsequence, called a segment, of len...

متن کامل

Genomic Sequence Analysis: A Case Study in Constrained Heaviest Segments (Working draft)

Methods for genomic sequence analysis have been studied for more than a decade. One line of investigation is to locate the biologically meaningful segments, like conserved regions or GC-rich regions in DNA sequences. A common approach is to assign a real number (also called scores) to each residue, and then look for the maximum-sum or maximum-average segment. In this chapter, we address a few i...

متن کامل

MAVG: locating non-overlapping maximum average segments in a given sequence

SUMMARY MAVG is a software tool for finding k non-overlapping maximum-average segments that are sufficiently long in a given sequence of real numbers, for any k > 0. It has applications in several areas of biomolecular sequence analysis including locating GC-rich regions and CpG islands in a genomic sequence, and annotating multiple sequence alignments. AVAILABILITY http://iubio.bio.indiana.e...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2002